Efficient Harvesting of Internet Audio for Resource-Scarce ASR

نویسندگان

Marelie H. Davel

Charl Johannes van Heerden

Neil Kleynhans

Etienne Barnard

چکیده

Spoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages or dialects. Techniques used to harvest such data typically assume the availability of a fairly accurate ASR system, which is generally not available when working with resourcescarce languages. In this work, we define a process whereby an ASR corpus is bootstrapped using unmatched ASR models in conjunction with speech and approximate transcriptions sourced from the Internet. We introduce a new segmentation technique based on the use of a phone-internal garbage model, and demonstrate how this technique (combined with limited filtering) can be used to develop a large, high-quality corpus in an underresourced dialect with minimal effort.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient data selection for ASR

Automatic speech recognition (ASR) technology has matured over the past few decades and has made significant impacts in a variety of fields, from assistive technologies to commercial products. However, ASR system development is a resource intensive activity and requires language resources in the form of text annotated audio recordings and pronunciation dictionaries. Unfortunately, many language...

متن کامل

Determining High Level Dialog Structure without Requiring the Words

The potentially enormous audio resources now available to both organizations, and on the Internet, present a serious challenge to audio browsing technology. In this paper we outline a set of techniques that can be used to determine high level dialog structure without the requirement of resource intensive, accent dependent, automatic speech recognition (ASR) technology. Using syllable finding al...

متن کامل

The Scarce Drugs Allocation Indicators in Iran: A Fuzzy Delphi Method Based Consensus

Objective: Almost all countries are affected by a variety of drug-supply problems and spend a considerable amount of time and resources to address shortages. The current study aims to reach a consensus on the scarce drug allocation measures to improve the allocation process of scarce drugs in Iran by a population needs-based approach. Methods: To achieve the objective, two phases were co...

متن کامل

Automatic Construction of the Finnish Parliament Speech Corpus

Automatic speech recognition (ASR) systems require large amounts of transcribed speech data, for training state-of-theart deep neural network (DNN) acoustic models. Transcribed speech is a scarce and expensive resource, and ASR systems are prone to underperform in domains where there is not a lot of training data available. In this work, we open up a vast and previously unused resource of trans...

متن کامل